Student Name:

In [ ]:
import altair as alt
import pandas as pd

Transformations, Scales, and Axes¶

Today we will work on transforming data within the chart specification and changing scales and axes to suit our needs. This data set is sourced from the US News college database, and represents ~1,300 US colleges. It includes variables like admissions statistics, size, costs, and other aspects of the schools.

Be sure to read the data description before starting, as you will need to know the column names and data types to proceed.

Assignment Expectations¶

You are asked to create 5 total plots here. Each plot should follow the requirements and should be polished and production-quality (e.g. titles, axes labels, legend labels, reasonable scales, edited width and height, etc). Use your best judgement and take the time to make your plots look nice.

Some plots are accompanied by discussion questions. Be sure to answer all questions clearly in a code cell (or same cell) directly next to the prompt. Responses will be graded on thoroughness and concrete justifications that reference course concepts.

Data Description¶

Most variables are named in quite obvious ways. To save some space, here are descriptions for variables that are potentially unclear:

  • FICE (Federal ID Number)
  • Public/private indicator (public=1, private=2)
  • Average Math SAT score
  • Average Verbal SAT score
  • Average Combined SAT score
  • Average ACT score
  • First quartile - Math SAT
  • Third quartile - Math SAT
  • First quartile - Verbal SAT
  • Third quartile - Verbal SAT
  • First quartile - ACT
  • Third quartile - ACT
  • Number of applications received
  • Number of applicants accepted
  • Number of new students enrolled
  • Pct. new students from top 10% of H.S. class
  • Pct. new students from top 25% of H.S. class
  • Pct. of faculty with Ph.D.'s
  • Pct. of faculty with terminal degree
  • Student/faculty ratio
  • Pct.alumni who donate
  • Instructional expenditure per student
In [ ]:
dataurl = 'http://lib.stat.cmu.edu/datasets/colleges/usnews.data'
college = pd.read_csv(dataurl,
                       na_values='*',
                       names = ['FICE', 'College','State','Private','MathSAT','VerbalSAT','CombinedSAT','ACT',
                                'Q1Math','Q3Math','Q1Verbal','Q3Verbal','Q1ACT','Q3ACT',
                               'Applied','Accepted','Enrolled','Top10HS','Top25HS','FullTimeUG','PartTimeUG',
                               'InStateTuition','OutOfStateTuition','RoomBoardCost','RoomCost','BoardCost',
                               'ExtraFees','BookFees','PersonalFees',
                               'PhDFaculty','TerminalFaculty','StuFacRatio','AlumniDonate','SpendPerStudent','GradRate'])
college.head()
college["Top10HS"]
Out[ ]:
Top10HS
0 16.0
1 NaN
2 4.0
3 NaN
4 NaN
... ...
1297 NaN
1298 23.0
1299 28.0
1300 15.0
1301 23.0

1302 rows × 1 columns


Plot 1¶

Plot a histogram of in-state tuition. Edit the plot's scale, domain, and/or transformations to more clearly show a trend. Keep in mind that in-state tuition most likely applies only to public institutions, since private schools don't have different rates based on state residency. Include filters to account for this.

Write a clear statement that this plot shows, and justify your design choices. Thinking creatively, what else could we add to this plot to make it more compelling? You do not have to make the change, but be sure to be specific and justify your suggestion.

Most of schools has In-State tuition of $4000 or less.

The bars of histogram already have high data-ink ratio and space is crowded, so it's not ideal to introduce too many variables in the chart, so it would require to switch to different mark/chart.

In [ ]:
in_state_tuition_chart = alt.Chart(college).mark_bar().encode(
    x=alt.X("InStateTuition:Q", bin=alt.BinParams(maxbins=25)),
    y=alt.Y("count()")
).transform_filter(
    "datum.Private == 1"
).properties(
    title="In-State Tuition"
)

in_state_tuition_chart
Out[ ]:

Plot 2¶

Using the transform_calculate() function, create a new variable to compute the acceptance rate based on the number of applications and acceptances. Make a scatterplot of the percent of students in the top 10% of their high school class (y) vs the acceptance rate (x). Think about the expected relationship here, especially regarding acceptance rate.

What might people expect to see in this plot, in terms of following conventions of visualization? Edit your x-axis to account for this, and justify your change in words. Then, clearly write the statement this plot makes.

People will see most of schools have both high acceptance rate and lower percentage of students in the top 10% of their high school class.

One thing I changed after making my initial plot is that when I got the Accpetance Rate from datum.Accepted / datum.Applied the result is on range of [0.0, 1.0], and the Top10HS from the already in [0, 100]; since both of them is about percentage so I converted them to the same scale.

In [ ]:
alt.Chart(college).transform_calculate(
    acceptance_rate="(datum.Accepted / datum.Applied) * 100"
).mark_point(
    size=4
).encode(
    x=alt.X("acceptance_rate:Q", title="Acceptance Rate"),
    y=alt.Y("Top10HS:Q", title="Pct. of new students from top 10% of H.S. class")
).properties(
    title="Acceptance Rate vs Percentage of Students in the Top 10% of High School Class",
    width=300,
    height=300
)
Out[ ]:

Plot 3¶

Calculate a new variable, the revenue, estimated using the in-state tuition, room and board costs, and full-time undergraduate population. Make a heatmap where, by state, the sum of revenue is shown for different school sizes (binned). Change the color scheme to something representative of the theme, selecting a scheme from the Vega color scheme documentation. We will talk more about good color scheme selection later.

What one-sentence question can this plot answer?

How does the average revenue of schools relate to their enrollment size across states?

In [ ]:
alt.Chart(college).transform_calculate(
    revenue="(datum.InStateTuition + datum.RoomBoardCost) * datum.FullTimeUG"
).mark_rect().encode(
    x=alt.X("Enrolled:Q", bin=alt.BinParams(maxbins=25), title="Enrollement"),
    y=alt.Y("State:N"),
    color=alt.Color("sum(revenue):Q",
                    scale=alt.Scale(scheme="yellowgreen"),
                    legend=alt.Legend(title="Combined Revenue")
                   )
).transform_filter(
    "datum.Private == 1"
).properties(
    title="Combined Revenue by School Size in Different States",
    width=800,
    height=800
)
Out[ ]:

Plot 4¶

Make a scatterplot the graduation rate (y) vs student to faculty ratio (x). Change the scales and domains to a reasonable setting for the data. Then, assign the private/public status to a color encoding and choose a new pair of colors from this list of options, different from the default and the examples seen in class.

It may also be useful to use the following code, which transforms the values in the Private column from 1 and 2 to "private" and "public" to make your legend more clear.

transform_calculate(RenamePrivate = 'if(datum.Private ==1, "public", "private")')

Describe why you chose the scale and domain you did, and discuss advantages and disadvantages of this choice.

I capped the range for GradRates to 100 to filter out abnormal data, and StuFacRatio to between 2 and 30 to focus on the range with most of data points, with trade-off of ignoring 3 outlier. The result chart has a good aspect ratio with 45° mid-angle

In [ ]:
alt.Chart(college).mark_point().transform_calculate(
    RenamePrivate = 'if(datum.Private==1, "public", "private")'
).encode(
    x=alt.X("StuFacRatio:Q", scale=alt.Scale(domain=[2, 30])),
    y=alt.Y("GradRate:Q"),
    color=alt.Color("RenamePrivate:N", scale=alt.Scale(domain=["public", "private"]), legend=alt.Legend(title="Private/Public"))
).transform_filter(
    "datum.GradRate <= 100 && datum.StuFacRatio <= 30"
).properties(
    title="Graduation Rate vs Student to Faculty Ratio",
    width=600,
    height=600
)
Out[ ]:

Plot 5¶

This plot should be the most challenging. Using the transform_aggregate() function, compute the median CombinedSAT score in each state. Using the transform_filter() function, filter to 5 states of interest to you. If you get stuck, look at the Altair Filtering documentation page.

Use the two position encodings to show the median SAT score by state. Then, sort the states by the median scores (there should be no alphabetical order here!).

The choice of mark and scale is up to you. Discuss your choice and justify your reasoning.

I used y-axis to display state as the texts are displayed left-to-right for better readabilty; so for media SAT score the only option left is the x-position as length, and coincidently we usually prefer the top means higher value thus this chart works well.

Since the values are close to each other and I shouldn't alter the range for preserving expressiveness, I added the value as text for each bar to reduce cognitie load.

In [ ]:
median_combined_sat_base = alt.Chart(college).transform_aggregate(
    median_combined_sat="median(CombinedSAT)",
    groupby=["State"]
).transform_filter(
    "datum.State == 'WA' || datum.State == 'OR' || datum.State == 'CA' || datum.State == 'NV' || datum.State == 'ID'"
)

median_combined_sat_bar = median_combined_sat_base.mark_bar().encode(
    y=alt.Y("State:N", sort="-x"),
    x=alt.X("median_combined_sat:Q")
)

median_combined_sat_tip = median_combined_sat_base.mark_text(
    align="left",
    dx=3
).encode(
    y=alt.Y("State:N", sort="-x"),
    x=alt.X("median_combined_sat:Q"),
    text=alt.Text("median_combined_sat:Q")
)

(median_combined_sat_bar + median_combined_sat_tip).properties(
    title="Median Combined SAT Score in Western States",
    width=300,
    height=300
)
Out[ ]:

Compsition¶

Plot 1A & 1B¶

Make a histogram of binned Math SAT scores, differentiated by private/public using color. Using the repeat function, expand the plot to also show Verbal SAT scores and ACT scores.

What, if anything, does this say about the selectivity of private vs public schools?

Private schools likely requires higher test scores.

This series of charts is not very effective at examining the relationship of selectivity we mean to show.

Make a new plot that showcases this relationship more effectively. Discuss why your changes improve the effectivness of this plot. Hint: think about what the color encoding is trying to show. What makes it hard to compare across this data? What transformation, layering, or change in design could we use to better show this difference?

Comparing the count would be less effective as the count of private and public schools already different, but the median value and range and already tell alot about the question. So I make it a boxplot that can still tell the same story with less data-ink.

In [ ]:
binned_math_sat_score_chart = alt.Chart(college).transform_calculate(
    RenamePrivate = 'if(datum.Private==1, "public", "private")'
).mark_bar().encode(
    x=alt.X(alt.repeat("column"), bin=alt.BinParams(maxbins=30)),
    y=alt.Y("count()"),
    color=alt.Color("RenamePrivate:N")
).repeat(
    column=["MathSAT", "VerbalSAT", "ACT"]
)

binned_math_sat_score_chart
Out[ ]:
In [ ]:
alt.Chart(college).transform_calculate(
    RenamePrivate="if(datum.Private == 1, 'Public', 'Private')"
).mark_boxplot().encode(
    x=alt.X("RenamePrivate:N", title="Private/Public"),
    y=alt.Y(alt.repeat("column"), type="quantitative", scale=alt.Scale(zero=False)),
    color="RenamePrivate:N"
).repeat(
    column=["MathSAT", "VerbalSAT", "ACT"]
).properties(
    title="Standarization Test Score by Private/Public School",
)
Out[ ]:

Plot 2A & 2B¶

Create a scatterplot of spending per student vs in-state tuition costs. Filter down to just public schools in WA. Choose a third encoding channel to show the name of the college.

What patterns does this plot reveal? What additional layer could be used to highlight this pattern? Make a new plot with this additional layer. Is your choice of encoding for school effective? If not, make this change in the new plot as well.

This plot shows the Spending Per Student is exceed the in-state tuition; and schools with similar In-State Tuition has similar Spend Per Student.The use of shape/color as the third encoding is effective so far, but only because there's very few data points in this view.

So I made the new chart with text label as the third encoding. It should be more effectie as it reduce the need to looking up for legends. However the school names are long making the chart crowded, as well as if the data points close to each other will obstruct readings.

In [ ]:
alt.Chart(college).transform_filter(
    "datum.State == 'WA' && datum.Private == 1"
).mark_point().encode(
    x=alt.X("InStateTuition:Q"),
    y=alt.Y("SpendPerStudent:Q"),
    shape="College:N",
    color=alt.Color("College:N", legend=alt.Legend(title="College Name"))
).properties(
    title="In-State Tuition vs Spending Per Student",
)
Out[ ]:
In [ ]:
wa_tuition_vs_spend_chart_base = alt.Chart(college).transform_filter(
    "datum.State == 'WA' && datum.Private == 1"
).mark_point().encode(
    x=alt.X("InStateTuition:Q"),
    y=alt.Y("SpendPerStudent:Q")
)

wa_tuition_vs_spend_chart_text = alt.Chart(college).transform_filter(
    "datum.State == 'WA' && datum.Private == 1"
).mark_text(
    align="right",
    dx=-5
).encode(
    x=alt.X("InStateTuition:Q"),
    y=alt.Y("SpendPerStudent:Q"),
    text="College"
)

(wa_tuition_vs_spend_chart_base + wa_tuition_vs_spend_chart_text).properties(
    title="In-State Tuition vs Spending Per Student",
    width=400,
    height=600
)
Out[ ]: